{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "6XvCUmCEd4Dm"
      },
      "source": [
        "# TensorFlow Datasets\n",
        "\n",
        "TFDS provides a collection of ready-to-use datasets for use with TensorFlow, Jax, and other Machine Learning frameworks.\n",
        "\n",
        "It handles downloading and preparing the data deterministically and constructing a `tf.data.Dataset` (or `np.array`).\n",
        "\n",
        "Note: Do not confuse [TFDS](https://www.tensorflow.org/datasets) (this library) with `tf.data` (TensorFlow API to build efficient data pipelines). TFDS is a high level wrapper around `tf.data`. If you're not familiar with this API, we encourage you to read [the official tf.data guide](https://www.tensorflow.org/guide/datasets) first.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "J8y9ZkLXmAZc"
      },
      "source": [
        "Copyright 2018 The TensorFlow Datasets Authors, Licensed under the Apache License, Version 2.0"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "OGw9EgE0tC0C"
      },
      "source": [
        "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/datasets/overview\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/datasets/blob/master/docs/overview.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "\u003c/table\u003e"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "_7hshda5eaGL"
      },
      "source": [
        "## Installation\n",
        "\n",
        "TFDS exists in two packages:\n",
        "\n",
        "* `tensorflow-datasets`: The stable version, released every few months.\n",
        "* `tfds-nightly`: Released every day, contains the last versions of the datasets.\n",
        "\n",
        "To install:\n",
        "\n",
        "```\n",
        "pip install tensorflow-datasets\n",
        "```\n",
        "\n",
        "Note: TFDS requires `tensorflow` (or `tensorflow-gpu`) to be already installed. TFDS support TF \u003e=1.15.\n",
        "\n",
        "This colab uses `tfds-nightly` and TF 2.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "both",
        "colab": {},
        "colab_type": "code",
        "id": "boeZp0sYbO41"
      },
      "outputs": [],
      "source": [
        "!pip install -q tensorflow\u003e=2 tfds-nightly matplotlib"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "TTBSvHcSLBzc"
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n",
        "import numpy as np\n",
        "import tensorflow as tf\n",
        "\n",
        "import tensorflow_datasets as tfds"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "VZZyuO13fPvk"
      },
      "source": [
        "## Find available datasets\n",
        "\n",
        "All dataset builders are subclass of `tfds.core.DatasetBuilder`. To get the list of available builders, uses `tfds.list_builders()` or look at our [catalog](https://www.tensorflow.org/datasets/catalog/overview)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "FAvbSVzjLCIb"
      },
      "outputs": [],
      "source": [
        "tfds.list_builders()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "VjI6VgOBf0v0"
      },
      "source": [
        "## Load a dataset\n",
        "\n",
        "The easiest way of loading a dataset is `tfds.load`. It will:\n",
        "\n",
        "1. Download the data and save it as [`tfrecord`](https://www.tensorflow.org/tutorials/load_data/tfrecord) files.\n",
        "2. Load the `tfrecord` and create the `tf.data.Dataset`.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "dCou80mnLLPV"
      },
      "outputs": [],
      "source": [
        "ds = tfds.load('mnist', split='train', shuffle_files=True)\n",
        "assert isinstance(ds, tf.data.Dataset)\n",
        "print(ds)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "byOXYCEJS7S6"
      },
      "source": [
        "Some common arguments:\n",
        "\n",
        "*   `split=`: Which split to read (e.g. `'train'`, `['train', 'test']`, `'train[80%:]'`,...). See our [split API guide](https://www.tensorflow.org/datasets/splits).\n",
        "*   `shuffle_files=`: Control whether to shuffle the files between each epoch (TFDS store big datasets in multiple smaller files).\n",
        "*   `data_dir=`: Location where the dataset is saved (\n",
        "defaults to `~/tensorflow_datasets/`)\n",
        "*   `with_info=True`: Returns the `tfds.core.DatasetInfo` containing dataset metadata\n",
        "*   `download=False`: Disable download\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "qeNmFx_1RXCb"
      },
      "source": [
        "`tfds.load` is a thin wrapper around `tfds.core.DatasetBuilder`. You can get the same output using the `tfds.core.DatasetBuilder` API:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "2zN_jQ2ER40W"
      },
      "outputs": [],
      "source": [
        "builder = tfds.builder('mnist')\n",
        "# 1. Create the tfrecord files (no-op if already exists)\n",
        "builder.download_and_prepare()\n",
        "# 2. Load the `tf.data.Dataset`\n",
        "ds = builder.as_dataset(split='train', shuffle_files=True)\n",
        "print(ds)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "aW132I-rbJXE"
      },
      "source": [
        "## Iterate over a dataset\n",
        "\n",
        "### As dict\n",
        "\n",
        "By default, the `tf.data.Dataset` object contains a `dict` of `tf.Tensor`s:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "JAGjXdk_bIYQ"
      },
      "outputs": [],
      "source": [
        "ds = tfds.load('mnist', split='train')\n",
        "ds = ds.take(1)  # Only take a single example\n",
        "\n",
        "for example in ds:  # example is `{'image': tf.Tensor, 'label': tf.Tensor}`\n",
        "  print(list(example.keys()))\n",
        "  image = example[\"image\"]\n",
        "  label = example[\"label\"]\n",
        "  print(image.shape, label)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "umAtqBBqdkDG"
      },
      "source": [
        "### As tuple\n",
        "\n",
        "By using `as_supervised=True`, you can get a tuple `(features, label)` instead for supervised datasets."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "nJ4O0xy3djfV"
      },
      "outputs": [],
      "source": [
        "ds = tfds.load('mnist', split='train', as_supervised=True)\n",
        "ds = ds.take(1)\n",
        "\n",
        "for image, label in ds:  # example is (image, label)\n",
        "  print(image.shape, label)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "u9palgyHfEwQ"
      },
      "source": [
        "### As numpy\n",
        "\n",
        "Uses `tfds.as_numpy` to convert:\n",
        "\n",
        "*   `tf.Tensor` -\u003e `np.array`\n",
        "*   `tf.data.Dataset` -\u003e `Generator[np.array]`\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "tzQTCUkAfe9R"
      },
      "outputs": [],
      "source": [
        "ds = tfds.load('mnist', split='train', as_supervised=True)\n",
        "ds = ds.take(1)\n",
        "\n",
        "for image, label in tfds.as_numpy(ds):\n",
        "  print(type(image), type(label), label)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "XaRN-LdXUkl_"
      },
      "source": [
        "### As batched tf.Tensor\n",
        "\n",
        "By using `batch_size=-1`, you can load the full dataset in a single batch.\n",
        "\n",
        "`tfds.load` will return a `dict` (`tuple` with `as_supervised=True`) of `tf.Tensor` (`np.array` with `tfds.as_numpy`).\n",
        "\n",
        "Be careful that your dataset can fit in memory, and that all examples have the same shape."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "Gg8BNsv-UzFl"
      },
      "outputs": [],
      "source": [
        "image, label = tfds.as_numpy(tfds.load(\n",
        "    'mnist',\n",
        "    split='test', \n",
        "    batch_size=-1, \n",
        "    as_supervised=True,\n",
        "))\n",
        "\n",
        "print(type(image), image.shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "o-cuwvVbeb43"
      },
      "source": [
        "### Build end-to-end pipeline\n",
        "\n",
        "To go further, you can look:\n",
        "\n",
        "*   Our [end-to-end Keras example](https://www.tensorflow.org/datasets/keras_example) to see a full training pipeline (with batching, shuffling,...).\n",
        "*   Our [performance guide](https://www.tensorflow.org/datasets/performances) to improve the speed of your pipelines.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "gTRTEQqscxAE"
      },
      "source": [
        "## Visualize a dataset\n",
        "\n",
        "Visualize datasets with `tfds.show_examples` (only image datasets supported now):\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "DpE2FD56cSQR"
      },
      "outputs": [],
      "source": [
        "ds, info = tfds.load('mnist', split='train', with_info=True)\n",
        "\n",
        "fig = tfds.show_examples(ds, info)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Y0iVVStvk0oI"
      },
      "source": [
        "## Access the dataset metadata\n",
        "\n",
        "All builders include a `tfds.core.DatasetInfo` object containing the dataset metadata.\n",
        "\n",
        "It can be accessed through:\n",
        "\n",
        "*   The `tfds.load` API:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "UgLgtcd1ljzt"
      },
      "outputs": [],
      "source": [
        "ds, info = tfds.load('mnist', with_info=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "XodyqNXrlxTM"
      },
      "source": [
        "*   The `tfds.core.DatasetBuilder` API:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "nmq97QkilxeL"
      },
      "outputs": [],
      "source": [
        "builder = tfds.builder('mnist')\n",
        "info = builder.info"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "zMGOk_ZsmPeu"
      },
      "source": [
        "The dataset info contains additional informations about the dataset (version, citation, homepage, description,...)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "O-wLIKD-mZQT"
      },
      "outputs": [],
      "source": [
        "print(info)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "1zvAfRtwnAFk"
      },
      "source": [
        "### Features metadata (label names, image shape,...)\n",
        "\n",
        "Access the `tfds.features.FeatureDict`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "RcyZXncqoFab"
      },
      "outputs": [],
      "source": [
        "info.features"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "KAm9AV7loyw5"
      },
      "source": [
        "Number of classes, label names:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "HhfzBH6qowpz"
      },
      "outputs": [],
      "source": [
        "print(info.features[\"label\"].num_classes)\n",
        "print(info.features[\"label\"].names)\n",
        "print(info.features[\"label\"].int2str(7))  # Human readable version (8 -\u003e 'cat')\n",
        "print(info.features[\"label\"].str2int('7'))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "g5eWtk9ro_AK"
      },
      "source": [
        "Shapes, dtypes:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "SergV_wQowLY"
      },
      "outputs": [],
      "source": [
        "print(info.features.shape)\n",
        "print(info.features.dtype)\n",
        "print(info.features['image'].shape)\n",
        "print(info.features['image'].dtype)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "thMOZ4IKm55N"
      },
      "source": [
        "### Split metadata (e.g. split names, number of examples,...)\n",
        "\n",
        "Access the `tfds.core.SplitDict`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "FBbfwA8Sp4ax"
      },
      "outputs": [],
      "source": [
        "print(info.splits)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "EVw1UVYa2HgN"
      },
      "source": [
        "Available splits:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "fRBieOOquDzX"
      },
      "outputs": [],
      "source": [
        "print(list(info.splits.keys()))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "iHW0VfA0t3dO"
      },
      "source": [
        "Get info on individual split:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "-h_OSpRsqKpP"
      },
      "outputs": [],
      "source": [
        "print(info.splits['train'].num_examples)\n",
        "print(info.splits['train'].filenames)\n",
        "print(info.splits['train'].num_shards)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "fWhSkHFNuLwW"
      },
      "source": [
        "It also works with the subsplit API:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "HO5irBZ3uIzQ"
      },
      "outputs": [],
      "source": [
        "print(info.splits['train[15%:75%]'].num_examples)\n",
        "print(info.splits['train[15%:75%]'].file_instructions)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "GmeeOokMODg2"
      },
      "source": [
        "## Citation\n",
        "\n",
        "If you're using `tensorflow-datasets` for a paper, please include the following citation, in addition to any citation specific to the used datasets (which can be found in the [dataset catalog](https://www.tensorflow.org/datasets/catalog)).\n",
        "\n",
        "```\n",
        "@misc{TFDS,\n",
        "  title = { {TensorFlow Datasets}, A collection of ready-to-use datasets},\n",
        "  howpublished = {\\url{https://www.tensorflow.org/datasets}},\n",
        "}\n",
        "```"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "tensorflow/datasets",
      "private_outputs": true,
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
